Day28｜生命線大公開！Grafana Dashboard 一眼看透系統健康

2025 iThome 鐵人賽

DAY 28

AI & Data

論文流浪記：我與AI 探索工具、組合流程、挑戰完整平台系列第 29 篇

17th鐵人賽

陳昱安

團隊等待阿毛參賽中

2025-10-05 07:23:46

527 瀏覽

分享至

欸，先說，今天這篇是關於 Grafana 的，對，你沒聽錯，就是那個可以把各種指標變成「彩色線條和圖表」的魔法工具。
而且啊，你懂的，作為工程師，我們平常不只是看數字，更多時候是在看自己的人生曲線：CPU 99%、內心焦慮 100%、咖啡杯空了 0%。

1️⃣ 為什麼要 Grafana？

好啦，回到正經事。前幾天 Day27 我們裝了 Blackbox Exporter，知道每個 API 活不活。
但是你看一堆 Prometheus metrics，你心情就像看銀行帳單：

probe_success{target="note-api"} 1
probe_duration_seconds{target="note-api"} 0.032
probe_http_status_code{target="note-api"} 200

嗯，看得懂，但老實說，長期這樣會瘋掉。
所以我們需要 Grafana，把這些冷冰冰的數字變成漂亮又直覺的 Dashboard。
就像你早上喝咖啡順便看新聞，不用自己去翻每條新聞源。

2️⃣ Docker Compose 部署 Grafana + Prometheus

先來看一下我們怎麼部署的，順便碎念一下：
Docker Compose 就像把整個監控生態裝進一個便當盒，方便搬來搬去，雖然搬到凌晨還是會卡 Docker network 🤦‍♂️

networks:
  monitor-net:
    external: true
  langfuse-otel-net:
    external: true

services:
  note-db:
    image: postgres:15
    container_name: note-db
    environment:
      POSTGRES_USER: ${POSTGRES_USER}
      POSTGRES_PASSWORD: ${POSTGRES_USER}
      POSTGRES_DB: ${POSTGRES_DB}
    ports:
      - "5411:5432"
    volumes:
      - ./data/note_db:/var/lib/postgresql/data
      - ./db/init-scripts:/docker-entrypoint-initdb.d:ro
    networks:
      - langfuse-otel-net
      - monitor-net
  prometheus:
    image: prom/prometheus
    user: "1001"
    volumes:
      - ./monitor/prometheus:/etc/config
      - ./monitor_data/prometheus_data:/prometheus
      - ./monitor/rules:/etc/prometheus/rules
    ports:
      - "127.0.0.1:9090:9090"
    command:
      - "--config.file=/etc/config/prometheus.yml"
      - "--storage.tsdb.path=/prometheus"
      - "--web.console.libraries=/usr/share/prometheus/console_libraries"
      - "--web.console.templates=/usr/share/prometheus/consoles"
    networks:
      - monitor-net
      - langfuse-otel-net
      - app-net

  grafana:
    image: grafana/grafana:latest
    user: "472"
    ports:
      - "0.0.0.0:3002:3000"
    volumes:
      - ./monitor/dashboards:/var/lib/grafana/dashboards # dashboard JSON, Node Exporter Full: 1860 , cAdvisor Exporter: 14282 , Prometheus Blackbox Exporter: 7587
      - ./monitor/provisioning:/etc/grafana/provisioning # provisioning 設定
    environment:
      GF_SECURITY_ADMIN_PASSWORD: ${GF_SECURITY_ADMIN_PASSWORD:-admin}
      GF_SECURITY_ADMIN_USER: ${GF_SECURITY_ADMIN_USER:-admin}
      GF_SECURITY_SECRET_KEY: ${GRAFANA_SECRET_KEY:-admin}
      GF_USERS_ALLOW_SIGN_UP: "false"
      GF_USERS_ALLOW_ORG_CREATE: "false"
      GF_AUTH_ANONYMOUS_ENABLED: "false"
      GF_SECURITY_ALLOW_EMBEDDING: "true"
    restart: always
    networks:
      - monitor-net

欸，老實說，這裡最討厭的就是 network 配錯，你會看著 Dashboard 白屏，心裡默念「我到底在監控什麼人生」。

3️⃣ Datasource 配置

Grafana 是展示工具，但它不會自己去抓 DB 或 Prometheus，你得告訴它「去哪裡撈數據」。
這就是 Datasource。好啦，我承認，我第一次弄時差點以為要寫 SQL 才能看 metric 🤦‍♂️

PostgreSQL

apiVersion: 1
datasources:
  - name: postgresql
    type: postgres
    access: proxy
    url: note-db:5432
    isDefault: false
    editable: true
    user: user
    database: note
    jsonData:
      sslmode: disable
      postgresVersion: 1300
    secureJsonData:
      password: password

prometheus

apiVersion: 1
datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true
    editable: true
    jsonData:
      defaultDatabase: default
      tlsAuth: false
      tlsAuthWithCACert: false

欸，這裡要注意，如果 URL 或 port 配錯，Dashboard 就像失戀一樣——只剩空白，什麼都抓不到。

4️⃣ Dashboard provisioning

好了，接下來你想要自動化載入 Dashboard，就得用 provisioning，免得每次手動 import，老實說那種事我寧願去倒咖啡。

apiVersion: 1

providers:
  - name: 'All Dashboards'
    orgId: 1
    folder: ''
    type: file
    disableDeletion: false
    options:
      path: /var/lib/grafana/dashboards

  - name: 'System Dashboards'
    orgId: 1
    folder: 'System Metrics'
    type: file
    disableDeletion: false
    options:
      path: /var/lib/grafana/dashboards/system

  - name: 'App Dashboards'
    orgId: 1
    folder: 'App Metrics'
    type: file
    disableDeletion: false
    options:
      path: /var/lib/grafana/dashboards/app

我第一次弄這個，結果弄到凌晨 2 點才發現 JSON 檔少一個逗號…人生啊，逗號都比我懂細節。

5️⃣ 專案結構

欸，你如果像我一樣，專案亂放，那 Dashboard JSON 找不到比找女朋友還難。

.
├── alertmanager
│   └── alertmanager.yml
├── blackbox
│   └── blackbox.yml
├── dashboards
│   ├── app
│   │   ├── dashboard_apiGateway.json
│   │   ├── dashboard_noteservice.json
│   │   └── dashboard_redis_db_calling.json
│   ├── dashboard_data_analystics.json
│   ├── system
│   │   ├── blackbox_exporter.json
│   │   ├── cadvisor.json
│   │   └── node_exporter_full.json
│   └── top-level.json
├── prometheus
│   └── prometheus.yml
├── provisioning
│   ├── dashboards
│   │   └── dashboards.yml
│   └── datasources
│       ├── postgresql.yml
│       └── prometheus.yml
├── readme.md
└── rules
    └── rule.yml

6️⃣ Dashboard 怎麼看？

直接使用別人寫好的 dashbaord

Node Exporter

類比：就像早上量血壓、量心跳。
Metrics：CPU、Memory、Disk、Network。
Panel 建議：
- Chart / Graph → CPU、Memory 趨勢
- Gauge / Stat → Disk 使用率、Network 流量
生活比喻：你心情低落，記得喝咖啡；CPU 高飆，記得重啟容器。
💭 ：CPU 99% 就像你早上喝了三杯咖啡，心臟快跳出來，但還要裝鎮定。

cAdvisor

類比：看容器狀態，就像觀察同事做事效率。
Metrics：container CPU、Memory、IO。
Panel 建議：
- Graph → 容器 CPU/Memory 使用趨勢
- Table → 各容器即時狀態
心理觀察：有些容器就像拖延症同事，吃資源不出力。
💭 ：看著一個 container 占了 90% CPU，心裡默默喊「你今天要出力了吧？」

Blackbox Exporter

類比：從外部確認 API 健康。
Metrics：
- probe_success → 成功 1 / 失敗 0
- probe_duration_seconds → 延遲
- probe_http_status_code → HTTP 回應碼
Panel 建議：
- Stat → probe_success，一眼判斷活不活
- Graph → probe_duration_seconds 趨勢
- Table → HTTP code 分析
職場比喻：你覺得服務活著，但客戶回報「它根本死掉了」…嗯，人生也是。

FastAPI

類比：內部心跳檢查，就像量自己心情指數
Metrics：
- Request count / Latency / Error rate
- 自訂心跳指標（heartbeat）
Panel 建議：
- Graph → API latency 趨勢
- Stat / Gauge → 成功/失敗比例
- Table → 各 endpoint 即時請求量
  💭 ：心跳正常不代表人快樂，API 心跳正常也不代表沒有 bug，但至少知道它還在呼吸。

回顧 prometheus

好啦，聊完 Grafana Dashboard 長什麼樣，我們先回顧看看 Prometheus 怎麼抓資料，因為沒有資料，Dashboard 再漂亮也只是一堆彩色方塊。

我把你目前的 prometheus.yml 拿出來整理一下，順便標記哪些是 Node Exporter、cAdvisor、Blackbox、FastAPI 的 metrics。

global:
  scrape_interval: 15s  # 每 15 秒抓一次 metrics

alerting:
  alertmanagers:
    - static_configs:
        - targets:
            - "alertmanager:9093"

rule_files:
  - "/etc/prometheus/rules/*.yml"

這是全域設定，告訴 Prometheus：

每 15 秒抓一次 metric → 欸，基本上就像你早上醒來看看咖啡杯是不是空了一樣。
alertmanager → 誰來接收告警
rule_files → 怎麼判斷要告警

Host / Container 層 metrics

scrape_configs:
  - job_name: 'app_metrics'
    static_configs:
      - targets:
          - 'node-exporter:9100'
          - 'cadvisor:8080'

node-exporter:9100 → Node Exporter
- CPU、Memory、Disk、Network…主機生命跡象
- 💭 碎念：CPU 99%，心裡默默想「我們先爆掉的是誰？」
cadvisor:8080 → cAdvisor
- Container 層的資源監控
- 💭 碎念：容器就像拖延症同事，吃資源不出力

FastAPI / Application metrics

  - job_name: 'noteserver'
    static_configs:
      - targets: ['noteserver:8000']

/metrics 預設抓 FastAPI 指標
常見 metrics：Request count、Latency、Error rate、Custom counters（像心跳指標）
💭 碎念：就是你早上醒來看咖啡杯一樣，知道系統今天還活著

Blackbox Exporter：外部視角健康檢查

HTTP probe

  - job_name: 'blackbox-http-health'
    metrics_path: /probe
    params:
      module: [http_2xx_health]

Targets: noteserver, apiGateway, open-webui, note-storage, ollama, note-qdrant, langfuse-web, grafana, prometheus
Metrics：
- probe_success → 1/0
- probe_duration_seconds → 延遲
- probe_http_status_code → HTTP 回應碼
💭 碎念：probe_success = 1，但 HTTP 500 → API 活，但在翻白眼

TCP probe

  - job_name: 'blackbox-tcp'
    metrics_path: /probe
    params:
      module: [tcp_connect]
    static_configs:
      - targets:
          - 'note-db:5432'
          - 'redis:6379'
          - 'note-qdrant:6333'
          - 'note-storage:9000'

Metrics：
- probe_success → 1/0
- probe_duration_seconds → TCP 連線延遲
💭 碎念：就像戳一下門鈴，看屋主在不在家

小結表格

Metrics 層級理解

Job Name	Metric 類型	Layer	心態比喻
app_metrics/node-exporter	Node Exporter metrics	Host / OS	血壓、心跳、體重秤
app_metrics/cadvisor	cAdvisor metrics	Container	拖延症同事的效率
api_gateway / noteserver	FastAPI metrics	Application	心跳、API 活躍度
blackbox-http-health	Blackbox HTTP probe	External API / Health	外部戳一下 API，看活不活
blackbox-tcp	Blackbox TCP probe	TCP Service / DB	戳門鈴看屋主在不在

💭 ：

Node Exporter、cAdvisor → 看裡面的機器/容器狀態
Blackbox Exporter → 外部觀察，確保系統真的能用
FastAPI metrics → 內心指標，真正反映應用活不活

7️⃣ 可視化小技巧

Panel 類型	適用場景	建議
Graph / Time series	時間序列指標	CPU、Memory、API Latency
Stat / Gauge	單一指標一目了然，瞬間知道狀態	Probe 成功率、心跳指標
Table	多個 target 的即時狀態	Blackbox Exporter 成功/失敗列表
Heatmap	適合高維度、密集的容器使用量	Container 資源使用量

📝 小結

Grafana 的意義：把 Prometheus、Blackbox、cAdvisor、FastAPI 的各種指標，轉化為直覺、彩色 Dashboard，方便快速掌握系統健康。
部署關鍵：Docker Compose 整合 Grafana、Prometheus、DB，network 配置務必正確，避免 Dashboard 白屏。
Datasource：Grafana 需正確指向 Prometheus 或 PostgreSQL，URL/Port 配錯就抓不到資料。
Dashboard Provisioning：自動化載入 JSON Dashboard，免去每次手動 import 的麻煩。
Metrics 層級理解：
- Node Exporter → Host / OS 層，監控 CPU、Memory、Disk、Network
- cAdvisor → Container 層，觀察容器資源使用
- FastAPI → Application 層，追蹤 Request Count、Latency、Error rate
- Blackbox Exporter → External / TCP 層，從外部確認 API 或服務可用性
可視化 Panel 建議：
- Graph → 趨勢類指標
- Stat/Gauge → 單一指標快速掌握
- Table → 多目標即時狀態
- Heatmap → 高維度、密集資源使用